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Description 

[0001] The present invention concerns a search system for information retrieval, particularly information stored in 
form of text, wherein a text T comprises words and/or symbols s and sequences thereof S, wherein the information 
retrieval takes place with a given or varying degree of matching between a query Q, wherein the query O comprises 
words and/or symbols q and sequences P thereof, and retrieved information R comprising words and/or symbols and 
sequences thereof from the text T : wherein the search system comprises a data structure for storing at least a part of 
the text T, and a metric M which measures the degree of matching between the query Q and retrieved information R, 
and wherein the search system implements search algorithms for executing a search, particularly a full text search on 
the basis of keywords kw, and a method in a search system for information retrieval, particularly information stored in 
form of text wherein a text Tcomprises words and symbols s and sequences Sthereof , wherein the information retrieval 
takes place with a given or varying degree of matching between a query Q, wherein the query O comprises words and/ 
or symbols q and sequences P thereof, and retrieved information R comprising words and/or symbols and sequences 
thereof from the text T, wherein the search system comprises a data structure for storing at least a part of the text 7", 
and a metric M which measures the degree of matching between the query Q and retrieved information fl, and wherein 
the search system implements search algorithms for executing a search, particularly a full text search on the basis of 
keywords kw, wherein the information in the text T is divided into words s and word sequences S, the words being 
substrings of the entire text separated by word boundary terms and forming a sequence of symbols, and wherein each 
word is structured as a sequence of symbols. 
[0002] The invention also concerns the use of the search system. 

[0003] A tremendous amount of information in various fields of human knowledge is collected and stored in computer 
memory systems. As the computer memory systems increasingly are linked in public available data communication 
networks, there has been an increasing effort to develop systems and methods for searching and retrieving information 
for public or personal use. Present search methods for data have, however, limitations that seriously reduce the pos- 
sibility of efficiently retrieving and using information stored in this manner. 

[0004] Information may be stored in the form of different data types, and in the context of information search and 
retrieval it will be useful to discern between dynamic data and static data. Dynamic data is data that change often and 
continuously, so that the set of valid data varies all the time, while static data only changes very seldom or never at 
all. For instance will economic data, such as stock values, or meteorological data be subject to very quick changes 
and hence dynamic. On the other hand archival storage of books and documents are usually permanent and static 
data. The concept the volatility of the data relates to how long the information is valid. The volatility of data has some 
bearing upon how the information should be searched and retrieved. Large volumes of data require some structure in 
order to facilitate searching, but the time cost of building such structures must not be higher than the time the data are 
valid. The cost of building a structure is dependent on the data volume and hence the building of data structures for 
searching the information should take both the data volume and the volatility into consideration. The information col- 
lected are stored in databases and these may be structured or unstructured. Moreover, the databases may contain 
several types of documents, including compound documents which contain images, video, sound and formatted or 
annotated text. Particularly structured databases are usually furnished with indexes in order to facilitate searching and 
retrieving the data. The growth of the World Wide Web (WWW) offers a steadily growing collection of compound and 
hyperlinked documents. A great many of these are not collected in structured databases and no indexes facilitating 
rapid searching are available. However, the need for searching documents in the World Wide Web is obvious and as 
a result a number of so-called search engines has been developed, enabling searching at least parts of the information 
in the World Wide Web. 

[0005] With a search engine it is commonly understood one or more tools for searching and retrieving information. 
In addition to the search system proper, a search engine also contains an index, for instance comprising text from a 
large number of uniform resource locators (URLs ). Examples of such search engines are AltaVista, HotBot with Inktomi 
technology, Infoseek, Excite and Yahoo. All these offer facilities for performing search and retrieval of information in 
the World Wide Web. However, their speed and efficiency do by no means match the huge amount of information 
available on the World Wide Web and hence the search and retrieval efficiency of these search engines leaves much 
to be desired. 

[0006] Searching a large collection of text documents can usually be done with several query types. The most com- 
mon query type is matching and variants of this. By specifying a keyword or set of keywords that has to be present in 
the queried information the search system retrieves all documents that fulfils this requirement. The basic search method 
is based on so-called single keyword matching. The keyword p is searched for and all documents containing this word 
shall be retrieved. It is also possible to search for a keyword prefix p i and all documents where this prefix is present in 
any keyword in the documents., will be retrieved. Instead of searching with keywords, the search is sometimes based 
on so-called exact phrase matching, where the search uses several single keywords in particular sequence. As well- 
known by persons skilled in the art, the exact matching of keyword phrases in many search systems may be done with 
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the use of Boolean operators, for instance based on operators such as AND, OR, and NOT which allow a filtering of 
the information; e.g. using an AND phrase results in that all documents containing the two keywords linked by the AND 
operator will be returned. Also a NEAR operator has been used for returning just the documents with the keywords 
matching and located "near" to each other in the document text. In many structured database the documents contained 
5 in the database have been annotated, e.g. provided with fields which denote certain parts or types of information in 
the document. This allows the search for matches in only parts of the documents and is useful when the type of queried 
information is known in advance. 

[0007] When searching in text documents the data are structured and most likely present in some natural language, 
like English, Norwegian etc. When searching for documents with a certain context it is possible to apply proximity 

io metrics for matching keywords or phrases that match the query approximately. Allowing errors in keywords and phrases 
are common method for proximity, using a thesaurus is another common method. A proximity search requires only 
that there shall be a partial match between the information retrieved and the query. International published application 
WO96/00945 titled "Variable length data sequence matching method and apparatus" (Doringer & al.) which has been 
assigned to International Business Machines, Corp., discloses the building, maintenance and use of a database with 

15 a trie-like structure for storing entries and retrieving at least a partial match, preferably the longest partial match or all 
partial matches of a search argument (input key) from the entries. 

[0008] In order to further illuminate the general prior art mention can be made of international published patent 
application W092/15954 (Kimball & al., assigned to Red Brick System, U.S.A.) and US patent no. 5 627 748 (Baker 
& al., assigned to Lucent Technologies, Inc., U.S.A.), both disclosing data structures in the form of suffix trees for 

20 searching/matching in a square matrix. Neither of these two publications disclose anything beyond a regular suffix tree, 
except for the use of a linked list during matching and do not teach or suggest approaches to limit the search space 
when searching for approximate matches. However, such approaches would be most desirable when applying data 
structures based on suffix trees to searching, particularly for approximate matches in extremely large document col- 
lections, such as may be found on the World Wide Web. 

25 [0009] The main object of the present invention is thus to provide a search system and a method for fast and efficient 
search and retrieval of information in large volumes of data. Particularly it is an object of the present invention to provide 
a search system suited for implementing search engines for searching of information systems with distributed large 
volume data storage, for instance Internet. It is to be understood that the search system according to the invention by 
no means shall be limited to searching and retrieving information stored in the form of alphanumeric symbols, but 

30 equally well may be applied to searching and retrieving information stored in the form of digitalized images and graphic 
symbols, as the word text used herein also may interpreted as images when these are represented wholly or partly as 
sets of symbols. It is also to be understood that the search system according to the invention can be implemented as 
software written in a suitable high-level language on commercially available computer systems, but it may also be 
implemented in the form of a dedicated processor device for searching and retrieving information of the aforementioned 

35 kind. 

[0010] The above-mentioned objects and advantages are realized according to the invention with a search system 
which is characterized in that the data structure comprises a tree structure in the form of a non-evenly spaced sparse 
suffix tree ST(T) for storing suffixes of words and/or symbols s and sequences S thereof in the text T, that the metric 
M comprises a combination of an edit distance metric D{s,q) for an approximate degree of matching between words 

40 and/or symbols s;q in respectively the text Tand a query Q and an edit distance metric D^S,^ for an approximate 
degree of matching between sequences S of words and/or symbols s in the text T and a query sequence P of words 
and/or symbols q in the query Q, the latter edit distance metric including weighting cost functions for edit operations 
which transform a sequence S of words and/or symbols s in the text T into the sequence Pof words and/or symbols 
q in the query Q, the weighting taking place with a value proportional to a change in the length of the sequence S upon 

*5 a transformation or dependent on the size of the words and/or symbols s;q in sequences S;P to be matched, that the 
implemented search algorithms comprise a first algorithm for determining the degree of matching between words and/ 
or symbols s;q in the suffix tree representation of respectively the text Tand a query Q, and a second algorithm for 
determining the degree of matching between sequences S;P of words and/or symbols $;q in the suffix tree represen- 
tation of respectively the text 7" and the query O, said first and/or second algorithms searching the data structure with 

50 queries Q in the form of either words, symbols, sequences of words or sequences of symbols or combinations thereof, 
such that information R is retrieved on the basis of query Qwith a specified degree of matching between the former 
and the latter, and that the search algorithms optionally also comprise a third algorithm for determining exact matching 
between words and/or symbols s;q in the suffix tree representation of respectively the text T and the query O and/or 
a fourth algorithm for determining exact matching between sequences S;P of words and/or symbols s;q in the suffix 

55 tree representation of respectively the text T and the query G, said third and/or fourth algorithms searching the data 
. structure with queries Q in the form of either words, symbols, sequences of words, or sequences of symbols or com- 
binations thereof, such that information R is retrieved on the basis of the query Owith an exact matching between the 
former and the latter. 



3 
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[0011] In an advantageous embodiment of the search system according to the invention the suffix tree ST(7) is a 

word-spaced sparse suffix tree SST^T), comprising only a subset of the suffixes in the text 7: 

[001 2] Preferably is then the word-spaced sparse suffix tree SST WS ( 7) a keyword-spaced sparse suffix tree SST,^ 

(?) 

5 [001 3] In further advantageous embodiments of the search system according to the invention the first algorithm for 
detecting the degree of keyword matching in a keyword-spaced sparse suffix tree SST^T) is implemented as dis- 
closed by dependent claim 4, the second algorithm for determining the degree of sequence matching in a keyword- 
spaced sparse suffix tree SST^T) implemented as disclosed by dependent claim 5, whereby a subroutine of the 
second algorithm preferably is implemented as disclosed by dependent claim 6 5 the third algorithm for determining an 

io exact keyword matching in a keyword-spaced sparse suffix tree SST^T) implemented as disclosed by dependent 
claim 7, and finally the fourth algorithm for determining an exact keyword sequence matching in a keyword-spaced 
sparse suffix tree SST^CO implemented as disclosed by dependent claim 8. 

[0014] The above-mentioned objects and advantages are also realized according to the invention with a method 
which is characterized by generating the data structure as a word-spaced sparse suffix tree SST WS (7) of a text 7 for 

15 representing all the suffixes starting at a word separator symbol in the text T, storing sequence information of the words 
s in the text Tin the word-spaced sparse suffix tree SST WS (7), generating a combined edit distance metric M comprising 
an edit distance metric D(s,q) for words s in the text Tand a query word q in a query Q and a word-size dependent 
edit distance metric D^jSfl for sequences S of words s in the text Tand a sequence P of words q in the query O. 
the edit distance metric D^S,/^ being the minimum sum of costs for edit operations transforming a sequence S into 

20 the sequence P, the minimum sum of costs being the minimum sum of cost functions for each edit operation weighted 
by a value proportional to the change in the total length of the sequence S or by the ratio of the current word length 
and average word length in the sequences S;P f and determining the degree of matching between words s;q by calcu- 
lating the edit distance D{s,q) between the words s of the retrieved information R and the word q of a query Q, or in 
case the words s,q are more than k errors from each other, determining the degree of matching between the word 

25 sequences S R ; P Q of retrieved information R and a query Q respectively by calculating the edit distance D ws (S fl ,P 0 ) 
for all matches. 

[0015] Advantageously the method according to the invention additionally comprises weighting an edit operation 
which changes a word s into word q with a parameter for the proximity between the characters of the words s;q, thus 
taking the similarity of the words s;q in regard when determining the cost of the edit operation in question. 

30 [0016] In an advantageous embodiment of the method according to the invention the number of matches is limited 
by calculating the edit distance Dw S (S R ,P Q ) for restricted number of words in the query word sequence Pq. 
[0017] In another advantageous embodiment of the method according to the invention the edit distance D(s, q) 
between word s and a word q is defined recursively and calculated by means of a dynamic programming procedure; 
and the edit distance D WS (S,P) between sequences S and a sequence Pis correspondingly recursively defined and 

35 calculated by means of a dynamic programming procedure. 

[0018] According to the invention the above-mentioned objects and advantages are also realized with the use of the 
search system according to the invention in an approximate search engine. 

[0019] The search system and the method according to the invention shall now be discussed in greater detail in the 
following with reference to the accompanying drawing figures, of which 

40 

fig. 1 shows an example of a suffix tree, 

fig. 2 examples of word-spaced sparse suffix trees as used with the present invention . 
45 fig. 3 an example of a so-called PATRICIA trie as known in prior art, 

fig. 4 a further example of a word-spaced sparse suffix tree as used with the present invention, 
fig. 5 an example of explicitly stored word sequence information as used with the present invention, 

50 

fig. 6 a leaf node structure as used with the present invention , and 

fig. 7 schematically the structure of a search engine with the search system according to the present invention. 

55 [0020] The search system according to the invention consists essentially of three parts : namely the data structure, 
the metrics for approximate matching and the search algorithm. When full text retrieval is the target, as essentially will 
be the case with the search system according to the present invention then the entire data set which shall be retrievable, 
will be stored in a data structure which supports a high query performance. 
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[0021] The basic concepts underlying the present invention shall first be discussed in some detail Stored information 
in the form of text T is divided into words $ and word sequences S. Words are substrings of the entire text separated 
by word boundary terms. The set of word boundary terms is denoted BT wor& A common set of word boundary terms 
could be the set V, \t , *\n\ \O t '?'} where \t denotes a tab character, \n denotes a linefeed character and \0 
5 denotes an end-of-document indicator. In connection with the following description of the present invention it will be 
useful with some definitions concerning strings and sequences. 

Definition 1 : String 

10 [0022] A string is a sequence of symbols taken from an alphabet, such as the ASCII characters. Then the length of 
a string is the number of instances of symbols or characters comprising the string, and is denoted Ixl. If x has the length 
m the string may also be written as x 1 x 2 .. x l ...x m , where x, represents the Ah symbol in the string. 
[0023] A substring of x is a string given by a contiguous group of symbols within x. Thus, a substring may be obtained 
from x by deleting one or more characters from the beginning or the end of the string. 

15 

Definition 2: Substring, suffix and prefix 

[0024] A substring of x is a string xj = xp h r . .Xj for some 1 < / < j < n. The string x t = xp = x t ..x n is a suffix of string x 
and the string x y - = x/= x 1 x 2 ...x y - is a prefix of string x. 
20 [0025] Also the notion of a word sequence will be used. 

Definition 3: Word sequence 

[0026] A word sequence is a sequence of separated, consecutive words. A word sequence S = s^s?. ... : s n consists 
25 of n single words (or strings) s 1t s 2 , up to s n . 

[0027] Word sequences are delimited by sequence boundary terms. The set sequence boundary terms are denoted 
BTgeq. A common set of sequence boundary terms could be the set {'0V}, where \0 indicates an end-of-document marker. 
[0028] The concept approximate word matching can be described as follows. 

[0029] Given a string s = SjS^-.s,, and a query term q = q^q 2 ...q m Tnen tne task is to find a " occurrences of q in s 
30 that is at most k errors away from the original query term q. A proximity metric determines how to calculate the errors 
between q and a potential match s t ..$j. 

[0030] A common metric for approximate word matching is the Levenstein distance or edit distance (V.I. Levenstein, 
"Binary codes capable of correcting deletions, insertions, and reversals", (Russian) Doklady Akademii nauk SSSR, 
Vol. 163, No. 4, pp. 845-8(1965); also Cybernetics and Control Theory, Vol. 10, No. 8, pp. 707-10, (1966)). This metric 
35 js defined as the minimum number of edit operations needed to transform one string into another. An edit operation is 
given by any rewrite rule, for instance: 

o (a->e), deletion 
o (e-»a), insertion 
40 o (a-^b), change 

[0031] Let p and m be two words of size / and j, respectively. Then D(i,j) denotes the edit distance between the Ah 
prefix of p and the yth prefix of m. The edit distance can then recursively be defined as: 



45 



D(/,0)= D(0,0= i 



50 



D(i\;) = mirv D(/,;-l)+ 1 

D(i-i f y-i)+a(w) 



(i) 



55 



where 



d(y) = 0 if p 1 = rrij else 1 
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[0032] It is also possible to define an approximate matching on the level of words in a word sequence and this can 
be described as follows. 

[0033] Given a text 7 consisting of the n words w u w^..w n where each of the words is a string of characters. A 
sequence pattern P consists of the m words p,, p^,—, p m The sequence pattern P is said to have an approximate 
5 occurrence in Til the sequence p p p^ p m differs with at most k errors from a sequence w h w j+1 , iv^for some /, 
/, such that 1< i<j < n. Again, a proximity metric determines how to calculate the number of errors between the two 
sequences. 

[0034] A text that shall be retrieved in a search system must be indexed in a manner which facilitates searching the 
data. Consequently the data structure is a kernel data structure of the search system according to the present invention 
10 and is based on so-called suffix trees and particularly a sparse suffix tree. These two kinds of structures shall be defined 
in the following. A suffix tree Sfl~) is a tree representation of ail possible suffixes in the text I AH unary nodes in a suffix 
tree S(T) are concatenated with its child to create a compact variant. 
[0035] Fig. 1 shows the suffix tree for the text 7"= "structure". 

[0036] Even more particularly the present invention is based on sparse suffix trees. These were introduced by J. 
*5 Karkkainen & E. Ukkonen, in "Sparse Suffix Trees", Proceedings of the Second Annual International Computing and 
Combinatorics Conference (COCOON '96), Springer Verlag, pp.219-230, which again was based on ideas published 
by D.R. Morrison, "PATRICIA- Practical Algorithm To Retrieve Information Coded in Alphanumeric", Journal of the ACM, 
15, pp. 514-534 (1968). A sparse suffix tree is defined as follows. 

20 Definition 4: Sparse suffix tree 

[0037] A sparse suffix tree S$>T(T) of the text T is a suffix tree, containing only a subset of the suffixes present in the 
suffix tree S1(T) of the text. 

[0038] When using the search system according to the present invention searching for entire words, advantageously 
25 a non-evenly spaced sparse suffix tree may be created by storing suffixes starting at word boundaries only. The concept 
words-spaced sparse suffix tree is defined as follows. 

Definition 5: Word-spaced sparse suffix tree 

[0039] A word-spaced sparse suffix tree SST WS (T) of a text T is a sparse suffix tree SST(7) containing only the 
suffixes starting at a word separator character in the text. 

[0040] Fig. 2 shows two examples of word-spaced sparse suffix trees. Parts of the suffixes have been omitted to 
enhance the readability. The word-spaced sparse suffix tree for 7~= "to be the best" is the left structure, and T = "to 
make the only major modification" is the right structure in fig. 2. 

[0041] In the search system of the present invention the text is naturally divided into words which are stored inde- 
pendently in the word-spaced sparse suffix tree. 

[0042] As the atomic search term for searching is the word itself, advantageously each suffix will be terminated at 
the end of the word. This reduces the sparse suffix tree to a so-called PATRICIA trie (Morrison, op.cit). A trie as defined 
in the literature is a rooted tree with the properties that each node, except the root, contains a symbol of the alphabet 
and that no two children of the same node contain the same symbol. It should be noted that the word trie derives from 
the word "retrieval" and hence indicates that the trie is a tree structure suitable for retrieval of data. A PATRICIA trie is 
defined as a keyword-spaced sparse suffix tree (KWS tree) where the suffixes stored in the leaf nodes are limited by 
keyword delimiters. An example of a PATRICIA trie for the set of keywords {"avoid", "abuse", "be", "become", "breathe", 
"say"} is shown in fig. 3. The structure used in the search system of the present invention differs from the PATRICIA 
trie because the search system explicitly stores sequence information of the words. Reducing the suffix length requires 
that the representation of the leaf node is changed. Pointers to the original text are replaced by the suffix string itself. 
A suffix length reduction of this kind is shown in fig. 4 for one of the strings shown in fig. 2. In other words fig. 4 shows 
the word-spaced sparse suffix tree for T = "to make the only major modification" and with suffixes cut off at word 
boundaries. A leaf node will contain a list of all positions where the word represented by the leaf node occurs. 
[0043] Instead of using the implicit sequence of information found in the original text, the present invention explicitly 
stores sequence information in the word-spaced sparse suffix tree. This is done by using pointers between the leaf 
nodes that represent consecutive words in the original text. As at least all the occurrences of the word represented by 
a particular leaf node are available, a pointer must be added to the next consecutive leaf. 

[0044] A leaf node contains only the suffix of the word it represents, so when traversing the sequence pointers in 
the occurrence list only the suffixes of each of the consecutive words are revealed. This is handled by storing the entire 
word in the leaf node instead of just the suffix and thus also data structure of the invention differs from the PATRICIA 
trie in this respect. The data structure for explicitly stored word sequence information with an occurrence list with 
pointers to the next consecutive word and to its occurrence is shown in fig. 5. 
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[0045] The search system according to the present invention uses a PATRICIA trie for organizing the occurrence list 
(Morrison, op.cit.). The PATRICIA trie enables the search system to access the list of all consecutive words matching 
the string p 2 in a time OOp^l), where \pj of course is the length of By using a PATRICIA trie to organize the list of 
occurrences, a completely defined tree structure is obtained for storing words from a text and maintaining the sequence 
information. A typical leaf node : with both a PATRICIA trie for the organized occurrence list and the extra unsorted list 
of occurrences, is shown in fig. 6. As an example the memory requirement for an occurrence list as used in the search 
system of the present invention, a database with about 742358 documents has a total of 333 856 744 words and a 
lexicon of 538 244 distinct words. The total size of the database is 2054.52 MB. The average word length is thus 6.45 
bytes. A sparse suffix tree will use 8 bytes for each internal node, using 32 bit pointers. It is assumed that an average 
of 3 internal nodes is used for each word. The leaf node would then require 6.45 bytes for storing the entire word plus 
32 bits for a pointer to an occurrence list. A total of 34.45 bytes/word gives a total size of 18.108 MB. In addition the 
occurrence list has the size of 4 bytes per entry and 1 2 bytes if the full version is to be used. Hence the total memory 
requirement of the occurrence list varies from 1273 MB to 3820 MB. The data structure using a sparse suffix tree will 
have a size between 60% to 200% of the original text. This is comparable with the requirements of an inverted file, but 
the sparse suffix tree as used in the search system according to the invention provides much faster searching, enables 
approximate matching and makes sequence matching easy to perform. 

[0046] In approximate searching, a metric is used to give an error measure of a possible match. The search system 
according to the present invention employs several metrics, and particularly a unique combination of metrics. These 
metrics along with the combined metric shall be discussed in the following. 

[0047] An edit distance metric as defined above allows the operations deletion, insertion and change which intuitively 
apply to words as well as characters. Common errors in matching phrases are missing, extra or changed words. Hence 
the edit distance metric as previously defined shall be adapted and extended in order to apply to the approximate word 
sequence matching problem. Edit operations for sequences are defined below. 

Definition 6: Edit operations for sequences 

[0048] For transforming one sequence S of words into another sequence P of words, the edit operations allowed on 
the word in the sequences may be written according to the following rewrite rules: 

• (a^e), deletion of word a from the sequence 

• (e-*a), insertion of word a into the sequence 

• (a->b), change of word a into word b 

• (ab-^ba), transposition of adjacent words a and b. 

[0049] Instead of characters as atoms, the search system according to the invention applies the edit operations to 
words which then should be regarded as the operational atoms. 
[0050] A cost function c^x-^y) is a constant which is defined as 



1 



delete 




insert 



(2) 



transpose 



d(x,y) change 



where d(x r y) is defined as 




(3) 



[0051] By using the edit operations as defined above the edit distance for sequences can now be defined. 
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Definition 7; Edit distance for sequences 

[0052] The edit distance metric for sequences defines the distance D^S.P) between the sequence S= s p s 2 ,...,s n 
and the sequence ...,p m as the minimum sum of cost c(x^y) for the sequence of edit operations transforming 

the sequence S into the sequence R 

[0053] The search system according to the present invention enhances the edit distances metric for sequences to 
weight the cost of the edit operations by the size of the words operated upon. 

Definition 8: Word size-dependent edit distance for sequences 

[0054] The word size dependent edit distance for sequences is defined as the minimum sum of costs for the editing 
operations needed to transform one sequence into the other. The cost functions are dependent on the word size of 
their operands. 

[0055] In the search system according to the invention a definition of cost functions is given by the equations 



20 



transpose 

25 



I (4) 



t x max(||fl| 



where / denotes the average length of a word in the two sequences being compared. The cost of each edit operation 
is weighted by a value proportional to the change in the total length of the sequence or by the ratio of the current word 
length and the average word length in the sequences considered. 

[0056] Now the distance metric reflects the assumption of some relation between the word length and how important 
the word is to the semantic context of the word sequence. Furthermore the search system according to the invention 
employs proximity at the character level when the change edit operation (a^b) is used. Replacing a word a by another 
word b should be related to the similarity between these two words. The new cost function for the change edit operation 
hence is given as: 



W-^-WW=*iptf (5) 



where 



45 d appmx (a,b) = D{a,b) {6) 

[0057] Where D(a,b) is the normalized edit distance measuring function for words, 0 means full similarity, 1 means 
no similarity. 

[0058] The search system according to the invention combines the edit distance metric for sequences with the cost 
50 functions as given by formulas (4), (5) and (6), with an edit distance metric for words as given by formula (1). This 
means that sequence edit operations are only used when the words being matched are more than k errors away from 
each other. 

[0059] The algorithms used in the search system according to the invention perform efficient searching of the de- 
scribed structures. Matches are found according to the metrics as given above. 
55 [0060] Approximate word matching in a word-spaced sparse suffix tree is done by combining the calculation of the 
edit distance matrix and a traversal of the suffix tree. An algorithm for this is written in pseudo-code and given in table I. 
[0061] This algorithm is adapted from a trie-matching algorithm as proposed by H. Shang & T.H. Merrettal, Tries for 
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Approximate String Matching", IEEE Transactions on Knowledge and Data Engineering, Vol.5, No. 4, pp. 540-547 
(1996). The expected worst case running time of the algorithm is 0(/ri£l*) according to Shang & Merrettal (op.cit.). 
[0062] Approximate word sequence matching requires the calculation of the word sequence edit distance for all 
possible matches. However, the number of possible matches can be limited by starting the calculation of the edit 
distance only on the possible words. The cost of deleting a word from the sequences determines the number of possible 
start words. If the accumulated cost of deleting the / first words in a query sequence P Q rises above a given error 
threshold, the candidate sequence starting with the Ah word of the query cannot possibly be a match. Therefore for a 
query sequence P Q of / words, at most / possible start words will be tried. Since there are no backpointers in the 
sequence structure of the tree, it will not be ensured that all possible matches are obtained. Adding backpointers would 
solve this problem. The algorithm for approximate word sequence matching as used in the search system according 
to the present invention, is given in pseudo-code in table II below. This algorithm tries to match the first keyword with 
p p p2- > sequentially, testing all possible start positions. 

[0063] In the ApproxSequenceMatch algorithm in table II the ApproxMatchRest function is defined by the algorithm 
in table III below. This function matches the remaining sequence, using an initial error value. 
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Table I 

FindApproxima te (root ,p,k) 
node <r- root; 
i 1; 

nodes <- Children (node); // A stack of 
for all v e nodes do 

if IsLeaf (v) then 
for j f- i to length (Suffix (node) ) do 
Wj <- Suf fix (node) j-i; 

if wj = then // is a stopcha 

output ; 

return; 
if EditDist (i) = oo then 

break; 

else //Internal node 

i <r- i+1; 

4- label (v) 
if EditDist (i) - oo then 
break; 

nodes <- Children (v) □ nodes? 
II end for 

EditDistanceij) II Calculates jth row 
for i<- 1 to length (P) do 

if pi - Wj then 3 4-0 else 3 <- 1; 
ci = D[i-1, j] + c ins (jnj) ; 
c 2 = + c de j (pi) ; 

C 3 = £[i-l,j-l] + C cftange (Pi,^) / 

C 4 = D[i~2,j-2) + Ctranspcse(Pi/mj-iJ ; 

D[i,j] <- Cf r a C tion ( • min ( Ci , c 2 , c 3 , c 4 ) ; 
if D[i,j) > k 
return oo; // No distance below k 

return D[i,j] 



10 



EP 1 095 326 B1 



Table II 



ApproxSequenceMa tch_ED (root, P (=pi ,P2,-, Pm) ,k) 
m<- |p| 
matches +- 0 
startError <- 0 
startlndex <- 1 

while startError < k OR startlndex £ m do 
startbJode <r- FindExact (Pstartmdex) / 
list <- . UnorderedOccurrenceList (startNode) ; 
for all v g lists do 
if ApproxMatchRest ( v, P,k, startError) then 
matches <- □ v; 

startError <- startError + c de2 (Pstartmdex) : 
startlndex <- startlndex + 1; 



11 



EP 1 095 326 B1 



Table HI 



ApproxMa t chRes t (u,P,K, startError) 
error startError/ 
lastError <- startError; 
column <- 0 / 
node u; 
for v <- p 2 to p, pl do 
node <- NextOccurrence {node) ; 
word <— Keyword (node) ; 
lastError error; 

error <- startError + EditDistance (column;/ 
if error > k AND iastError > * then 
return falser- 
return true; 

EdltDistance(j) // Calculates jth row 
for i<- 1 to length ( P) do 

if pi = Wj then d «- 0 else 9 4-1; 
ci = D[i-1, j] + c ins (/nj) ; 

C 2 = D[i, + C d el(Pi) ; 

C 3 = D[i-l f j-l] + CcAangeCpi,^) ; 

C 4 = D[i-2 / j-2] + CtransposeiPi, mj-!) ; 

D ii'j] *~ Cfraccionlj/D - min ( Ci , c 2 , c 3 , c 4 ) ; 
if D[i,j] > * 
return qo; // No distance below k 

return D[i,j] 



[0064] The algorithms in tables II and III are written in the same pseudo-code as the algorithm in table I. 
[0065] The FindExact function used to find the leaf node matching the first word in the sequence performs a simple 
traversal of the tree and its running time is Olp r l where p, denotes the first word in a query sequence P Q . Calculating 
the edit distance can be done in IPI* time using straightforward dynamic programming or in 0(*) time (where k denotes 
the error threshold) using improved versions of the calculation algorithm, see E. Ukkonen : "Finding Approximate Pat- 
terns in Strings", Journal of Algorithms, vol. 6, pp. 132-137 (1985). 

[0066] If Ln^pfr denotes the total sum of the number of occurrences of each word p, in the word sequence then 
the worst case running time is OikLn^pj). 

[0067] Finally the implementation of a search engine based on the search system according to the invention shall 
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briefly be discussed. Particularly a search engine based on the search system according to the invention is implemented 
as an approximate search engine (ASE) and is intended as a search engine for indexing large document collections 
and providing algorithms for exact and approximate searching of these document collections. ASE shall provide a data 
structure for storing large texts or collection of documents. It is to be understood that the data structure may be gen- 

5 erated from documents which contain additional information, such as images, video, sound, and the text may be for- 
matted and/or annotated. The data structure is identical to the word-spaced sparse suffix tree as discussed above and 
it is, of course, to be understood that the words is the keywords of the search system, hence the word-spaced sparse 
suffix tree may instead be termed a keyword-spaced sparse suffix tree (KWS tree). The ASE shall contain algorithms 
for indexing documents in the KWS tree. These algorithms, of course, do not form a part of the search system according 

10 to the present invention, but they are well-known to persons skilled in the art and described in the literature, see for 
instance J. Karkkainen & E. Ukkonen (op.cit.) and D R. Morrison (op.cit). 

[0068] The search system according to the invention and as used in the ASE employs algorithms both for exact and 
approximate matching of a pattern in a KWS tree. The algorithms given above in table I and table II are used for 
approximate word and word sequence matching with the non-uniform edit distance as a metric. Finding an exact match 

15 of keyword p with length m in a KWS tree is known in the art and easily implemented as a simple traversal of the tree 
structure. An appropriate algorithm for exact keyword matching written in pseudo-code is given in table IV. The search 
system according to the invention also shall be able to support algorithms for exact keyword sequence matching. 
Algorithms for exact keyword sequence matching are known in the art and easily implemented as e.g. shown in pseudo- 
code in table V below. The algorithm given here will find the exact match of the first keyword, if any. Then it will for all 

20 occurrences of the first keyword check if the second keyword matches the second keyword of the query. If so, the 
MatchRest procedure in table V is used to determine if the occurrence of the two first keywords are matching in the 
entire sequence. For approximate keyword matching in a KWS tree the search system implements the algorithm in 
table I above. For approximate keyword sequence matching the search system implements the algorithm in table II 
above, matching a first keyword sequentially with p 1t p^- and testing all possible start positions, applying the Approx- 

25 MatchRestl unction as given in table III to match a sequence starting at a particular position and handle the initial error 
value. 

[0069] Finally, the ASE shall need a simple front end which gives the user control of indexing and querying the 
document collection. The front end should also be able to furnish statistics of the document collection and provide both 
a network interface for remote access, e.g. via WWW, and a local server user interface. 

30 [0070] The ASE with the search system according to the invention should be general in a manner that allows for the 
adding new indexing and searching algorithms easily. Also, storing extra information about each document or keyword 
shall be possible to implement in an easy manner. Particularly the front end should be independent of the data structure 
and the search algorithms, such that internal changes in these has no effect on the design of the former. 
[0071] The use of the search system according to the invention the ASE should be designed to have as low memory 

35 overhead as possible in the data structure. Also, searching should be designed to be as fast as possible. However, 
there will usually be a trade-off between these two factors. 
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FindExact (zoot,p) 
i<- 1; 

node <<- Find Child (root,pi); 
while node AND i < length (p) do 

if IsLeaf (node) AND Suffix (node) = Pi„.P m then 
return node; 

i <- i+1 

node FindChild (node, p^ ; 
return NIL; 

Table V 

MatchSequenceExact (P, root) 
matches <- 0; 

v +- FindExact (px,root) ; 
if I PI > 1 then 
if v * NIL then 

list UnorderedOccurrenceList (v); 
for all u g list do 
if NextKeyword ( u) = p 2 then 
if MatchRest ip 3 ~.Pm,u) then 
matches ^matches □ Occurrence ( u) ; 

return matches; 

MatchPest (P,u) 
node <- u; 

for v <- p, to p, P , do 
node <- NextOccurrence (node) ; 
word f- Keyword (node) ; 
if v * word then 
return false; 



[0072] To sum up, an ASE with a search system according to the invention shall comprise four major modules. 
1. Document indexing module DIM for indexing documents in the KWS tree structure. This module should also 
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contain all extensions to support several document types. 

2. Data storage module DSM based on a keyword-spaced sparse suffix tree (KWS tree). 

5 3. Search algorithm module SAM for searching the KWS tree, comprising algorithms for exact and/or approximate 

matching of respectively words and word sequences. 

4. User interface front-end module FEM comprising both a local server user interface and a network interface for 
remote queries. 

10 

[0073] The four modules of the ASE works together to offer a complete search engine functionality. The data flow 
between the different modules is shown in fig. 7. Indexing a collection of documents in done in the document indexing 
module DIM comprising indexing algorithms. This module is, of course, not a part of the search system according to 
the invention, but indexing algorithms that can be used are well-known in the art. The text found in the documents is 

15 passed on to the data storage DSM module for storage. The data storage module is, of course, a part of the search 
system according to the invention and is as stated based on the KWS tree structure. The search algorithm module 
SAM contains algorithms for searching the data located in the data storage module. This module implements the search 
system according to the present invention and allows for a search process querying the data structure for tree and 
node information, while maintaining state variables. The front-end module may for instance be implemented on a work 

20 station or a personal computer and the like, providing the functionality as stated above. 

[0074] As already stated in the introduction, it is to be understood that the search system according to the invention 
can be implemented as software written in a suitable high-level language on commercially available computer systems, 
including workstations. It may also as stated be implemented in the form of a dedicated processor device which ad- 
vantageously may comprise a large number of parallel processors being able to process large word sequences in 

25 parallel for approximate matching with a large number of query word sequences. The fixed operational parameters of 
the processor may then be entered in a low-level code, while keyword sequences input from the KWS tree structure 
allows for an extremely fast processing of queries on a huge amount of data, and the search system according to 
present invention shall hence in high degree be suited for performing searches on e.g. the World Wide Web, even in 
a KWS tree structure large enough to index all documents presently offered on the World Wide Web and moreover 

30 capable of handling the expected data volume growth on the World Wide Web in the future. 



Claims 

1 . A search system for information retrieval, particularly information stored in form of text, wherein a text T comprises 
words and/or symbols s and sequences S thereof, wherein the information retrieval takes place with a given or 
varying degree of matching between a query Q, wherein the query Q comprises words and/or symbols q and 
sequences P thereof, and retrieved information R comprising words and/or symbols and sequences thereof from 
the text T, wherein the search system comprises a data structure for storing at least a part of the text T, and a 
metric M which measures the degree of matching between the query Oand retrieved information R, and wherein 
the search system implements search algorithms for executing a search, particularly a full text search on the basis 
of keywords kw, characterized in that the data structure comprises a tree structure in the form of a non-evenly 
spaced sparse suffix tree ST(7) for storing suffixes of words and/or symbols s and sequences S thereof in the text 
that the metric M comprises a combination of an edit distance metric D(s,q) for an approximate degree of match- 
ing between words and/or symbols s;q in respectively the text Tand a query Q and an edit distance metric D WS (S, 
P) for an approximate degree of matching between sequences Sof words and/or symbols s in the text Tand a 
query sequence Pof words and/or symbols q in the query O, the latter edit distance metric including weighting 
cost functions for edit operations which transform sequences of words and/or symbols s in the text T into the 
sequence Pof words and/or symbols q in the query O, the weighting taking place with a value proportional to a 
change in the length of the sequence S upon a transformation or dependent on the size of the words and/or symbols 
s;q in sequences S;Pto be matched, that the implemented search algorithms comprise a first algorithm for deter- 
mining the degree of matching between words and/or symbols s;q in the suffix tree representation of respectively 
the text Tand a query Q, and a second algorithm for determining the degree of matching between sequences S; 
Pof words and/or symbols $;q in the suffix tree representation of respectively the text Tand the query Q, said first 
and/or second algorithms searching the data structure with queries Q in the form of either words, symbols, se- 
quences of words or sequences of symbols or combinations thereof, such that information R is retrieved on the 
basis of query Q with a specified degree of matching between the former and the latter, and that the search algo- 
rithms optionally also comprise a third algorithm for determining exact matching between words and/or symbols 
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s;q in the suffix tree representation of respectively the text T and the query O and/or a fourth algorithm for deter- 
mining exact matching between sequences S;P of words and/or symbols s;q in the suffix tree representation of 
respectively the text T and the query O, said third and/or fourth algorithms searching the data structure with queries 
O in the form of either words.. symbols, sequences of words, or sequences of symbols or combinations thereof 
such that information R is retrieved on the basis of the query O with an exact matching between the former and 
the latter 

A search system according to claim 1 , characterized in that the non-evenly spaced sparse suffix tree ST(7) is a 
word-spaced sparse suffix tree SST^ 7) comprising only a subset of the suffixes in the text T. 

A search system according to claim 2, characterized in that the word-spaced sparse suffix tree SST (71 is a 
keyword-spaced sparse suffix tree SST^T). * 

A search system according to claim 3, characterized in that the first algorithm for detecting the degree of keyword 
matching in a keyword-spaced sparse suffix tree SST^T) is implemented in pseudo-code as follows: 

FindApproxima te (root,p,k) 
node <— root; 
i <- 1; 

nodes <- Children (node); // A stack of nodes 
for all v € nodes do 

if IsLeaf (v) then 
for j <- i to length (Suffix (node) ) do 
Wj Suffix (node) 

if w 3 = , $ 1 then // is a stopchar 

OUtput 

return; 

if EditDist (i) = oo then 
break; 

else //Internal node 

i <— i + 

w ± <- label (v) 

if EditDist (i) « oc then 
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break; 

nodes «- Children (v) □ nodes; 
II end for 

EditDistance{j) II Calculates jth row 
for i«- 1 to length (P) do 

if pi = then d «- 0 else d <- 1; 
d = 0[i-l, j] + c irJS (in j ) ; 
c 2 » D[i, j-1] + CaeiiPi) ; 

C 3 = £?[i-l,J-l] + C C h6ciqe(Pi,toj) l 

C 4 = Z?[i-2,J-2] + CtcansposelPi, Mj-i) ; 

D [i*j] «~ Cf r actian( j/i)- min { Ci , c 2 , c 3 , c 4 ) ; 
if > * 

return oo; // No distance below k 

return D[i,j] 

A search system according to claim 3, characterized In that the second algorithm for determining the degree of 
keyword sequence matching in a keyword-spaced sparse suffix tree SST^T) is implemented in pseudo-code 
as follows: 

ApproxSequenceMatch_ED (root, P (=Pi,p2,~<, Pm) ,k) 
Jnf- |p| 
matches <— 0 
startError 4- .0 
startlndex <~ 1 

while startError < k OR startlndex < m do 
startNode <- FindExact (Pstattindex) / 
list <- UnorderedOccurrenceList (startNode) ; 
for all v e lists do 
if ApproxMatchRest (v,P,k, startError) then 
matches <- □ v; 

startError startError + c de x (Pstactmdex) ; 
startlndex startlndex + 1/ 

A search system according to claim 5, characterized In that the ApproxMatchRest subroutine of the second 
algorithm is implemented in pseudo-code as follows: 
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ApproxMatchRest (u , P,K, startError) 
error <- startError; 
lastError <- startError; 
column <- O ; 
node <- u; 
for v <r- p 2 to p IP | do 
node <r- NextOccurrence (node) ; 
word <- Keyword (node) ; 
lastError <- error; 

error <r- startError + EditDistance {column); 
if error > k AND lastError > k then 
return false; 

return true; 

EditDistance(j) // Calculates jth row 
for ±<- 1 to length (P) do 

if pi = Wj then a <- 0 else d <- 1 ; 
ci = D[i-l,j] + c ins (jnj) ; 
c 2 = D[i , + c d ei(Pi); 

c 3 = 0[i-l,j-l] + c chan9e (p u mj) ; 

C 4 = D[i-2,j-2] + Ctra^posetpi^jflj-!) ; 

j] c fraction (j/l)- min (ci, c 2 , c 3 , c 4 ) ; 
return co; // No distance below A: 
return D[i, j] 
A search system according to claim 3, 

characterized in that the third algorithm for determining an exact keyword matching in a keyword-spaced 
suffix tree SST^T) is implemented in pseudo-code as follows: 
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FindExact (root,p) 
i<- 1; 

node Find Child (root ,pi) ; 
while node AND i < length (p) do 

if IsLeaf (node) AND Suffix (node) = Pi...P m then 
return node; 

i <^ i + 1 

node <^ FindChild (node, Pi) ; 
RETURN NIL; 
A search system according to one of claims 3 to 7, 

characterized in that the fourth algorithm for determining an exact keyword sequence matching in a keyword- 
spaced sparse suffix tree SST^CO is implemented in pseudo-code as follows: 

MatchSeqvenceExact . (P, root) 
matches <- 0; 
v <- FindExact (pj,root); 
if | P| > 1 then 
if v * NIL then 
list <— UnorderedOccurrenceList (v) ; 
for all u g list do 
if NextKeyword ( u) = p 2 then 
if MatchRest (p 3 ...p m ,u) then 
matches ^-matches □ Occurrence ( u) ; 

return matches; 

MatchRest (P,u) 
node 4~ u; 

for v <— pi to pi p| do 

node <— NextOccurrence (node) ; 
word <- Keyword (node) ; 
if v s* word then 
return false; 

A method in a search system for information retrieval, particularly information stored in form of text, wherein a text 
7"comprises words and/or symbols s and sequences S thereof, wherein the information retrieval takes place with 
a given or varying degree of matching between a query Q, wherein the query O comprises words and/or symbols 
q and sequences Pthereof, and retrieved information R comprising words and/or symbols and sequences thereof 
from the text T f wherein the search system comprises a data structure for storing at least a part of the text T, and 
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a metric M which measures the degree of matching between the query O and retrieved information R and wherein 
the search system implements search algorithms for executing a search, particularly a full text search on the basis 
of keywords kw, wherein the information in the text T is divided into words s and word sequences S the words 
being substnngs of the entire text separated by word boundary terms and forming a sequence of symbols and 
wherein each word is structured as a sequence of symbols, characterized by generating the data structure as a 
word-spaced sparse suffix tree SST WS (7) of a text Tfor representing all the suffixes starting at a word separator 
symbol in the text 7; storing sequence information of the words s in the text Tin the word-spaced sparse suffix 
tree SST^ 7), generating a combined edit distance metric M comprising an edit distance metric D(s q) for words 
s in the text T and a query word q in a query Q and a word-size dependent edit distance metric D„(SP) for 
sequences Sof words s in the text Tand a sequence Pof words q in the query Q, the edit distance meWc D (S 
P) being the minimum sum of costs for edit operations transforming a sequence Sinto the sequence P, the minimum 
sum of costs being the minimum sum of cost functions for each edit operation weighted by a value proportional to 
the change in the total length of the sequence S or by the ratio of the current word length and average word length 
m the sequences S;R and determining the degree of matching between words s,q by calculating the edit distance 
D(s,q) between the words s of the retrieved information R and the word q of a query Q, or in case the words s q 
are more than /terrors from each other, determining the degree of matching between the word sequences S„ P„ 
of retrieved information R and a query O respectively by calculating the edit distance D^S^Pq) for all matches 

10. The method according to claim 9, characterized by additionally weighting an edit operation which changes a word 
s into word q with a parameter for the proximity between the characters of the words s;q, thus taking the similarity 
of the words s;q in regard when determining the cost of the edit operation in question. 

1 1 . The method according to claim 9, characterized by limiting the number of matches by calculating the edit distance 
D ws( s H. p o) for restricted number of words in the query word sequence Pq. 

12. A method according to claim 9, characterized by defining the edit distance D{s,q) between words s and a word 
q recursively and calculating the edit distance D(s,<j) by means of a dynamic programming procedure. 

13. A method according to claim 9, characterized by defining the edit distance D^S.P) between sequences Sand 
a sequence P recursively and calculating the edit distance D^S.P) by means of a dynamic programming proce- 

14. The use of a search system according to claim 1 in an approximate search engine. 



Patentanspriiche 



Suchsystem fur die Informationswiedergewinnung, insbesondere fiir Information, die in Form von Text gespeichert 
ist, wobe. ein Text TWorter und/oder Symbole s und Sequenzen S hieraus umfaBt, wobei die Informationswieder- 
gewinnung mit emem gegebenen oder variierenden Grad von Ubereinstimmung zwischen einer Anfrage O und 
wiedergewonnener Information R ablauft, wobei die Anfrage O Worter und/oder Symbole q und Sequenzen P 
hieraus umfaBt und wobei die Information R Worter und/oder Symbole und Sequenzen hieraus aus dem Text T 
umfaBt, wobei das Suchsystem eine Datenstruktur zum Speichern wenigstens eines Teils des Textes T und ein 
MaB M umfaBt, welches den Grad der Ubereinstimmung zwischen der Anfrage O und der wiedergewonnenen 
Information R miBt, und wobei das Suchsystem einen Suchalgorithmus zur Ausfuhrung einer Suche realisiert 
insbesondere einer Volltextsuche auf der Basis von Schlusselwortern kw, dadurch gekennzeichnet daB die 
Datenstruktur eine Baumstruktur in der Form eines nicht gleichmaBigen verteilten dunnbesiedelten Suffix-Baums 
ST(7) zum Speichern von Suffixen von Wortern und/oder Syrtibolen s und Sequenzen S daraus in dem Text T 
umfaBt, daB das MaB M eine Kombination aus einem Edit-AbstandsmaB D(s,g) fur einen ungefahren Grad an 
Ubereinstimmung zwischen Wortern und/oder Symbolen s;q in dem Text Tbzw. einer Anfrage O und ein Edit- 
AbstandsmaB D^{S,P) fur einen ungefahren Grad an Ubereinstimmung zwischen Sequenzen Svon Wortern und/ 
Oder Symbolen s in dem Text Tund einer Anfragesequenz P von Wortern und/oder Symbolen q in der Anfrage Q 
umfaBt, wooer das zuletzt genannte Edit-AbstandsmaB eine Gewicht-Kostenfunktion fur Editieroperationen um- 
faBt, welche Sequenzen von Wortern und/oder Symbolen s in dem Text Tin die Sequenz P von Wortern und/oder 
Symbolen q in der Anfrage Q transformiert, wobei die Gewichtung mit einem Wert erfolgt. der proportional zu einer 
Anderung der Lange der Sequenz S bei einer Transformation oder abhangig von der GroBe der Worter und/oder 
Symbole s;q in den abzugleichenden Sequenzen S;P ist, daB der implementierte Suchalgorithmus einen ersten 
Algorrthmus zum Enmitteln des Grades der Ubereinstimmung zwischen Wortern und/oder Symbolen s q in der 
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Suffix-Baumdarstellung des Textes Tbzw. einer Anfrage O und einen zweiten Algorithmic zum Ermitteln des 
Grades der Ubereinstimmung zwischen Sequenzen S;P von .Wortern und/oder Symbolen s;q in der Suffix- 
Baumdarstellung des Textes Tbzw. der Anfrage O umfaBt, wobei der erste und/oder zweite Algorithmus die Da- 
tenstrukturen mit Anfragen Oin der Form von entweder Wortern, Symbolen, Wortsequenzen oderSymbolsequen- 
zen oder Kombinationen daraus absucht, so daB Information Raut der Basis der Anfrage Omit einem bestimmten 
Grad an Ubereinstimmung zwischen ersterer und letzterer wiedergewonnen wird, und daB der Suchalgorithmus 
optional auch einen dritten Algorithmus zum Ermitteln der exakten Ubereinstimmung zwischen Wortern und/oder 
Symbolen s;q in der Suffix-Baumdarstellung des Textes Tbzw. der Anfrage O und/oder einen vierten Algorithmus 
zum Ermitteln der exakten Ubereinstimmung zwischen Sequenzen S;Pvon Wortern und/oder Symbolen s;qin der 
Suffix-Baumdarstellung des Textes Tbzw. der Anfrage Q umfaBt, wobei der dritte und/oder vierte Algorithmus die 
Datenstruktur mit Anfragen Oin der Form von entweder Wortern, Symbolen, Wortsequenzen oderSymbolsequen- 
zen oder Kombinationen daraus absuchen, so daB Information R auf der Basis der Anfrage Q mit einer exakten 
Ubereinstimmung zwischen ersterer und letzerer wiedergewonnen wird. 

Suchsystem nach Anspruch 1 , dadurch gekennzeichnet, daB der nicht gleichmaBig verteifte dunnbesiedelte 
Suffix-Baum ST (7) ein diinnbesiedelter Wortabstand-Suffix-Baum SST WS (T) ist, der nur eine Untermenge der 
Suffixe in dem Text T umfaBt. 

Suchsystem nach Anspruch 2, dadurch gekennzeichnet, daB der dunnbesiedelte Wortabstand-Suffix-Baum 
SST ws( T ) ein dunnbesidelter Stichwort-Abstand-Suffix-Baum SST^IT) ist. 

Suchsystem nach Anspruch 3, dadurch gekennzeichnet, daB der erste Algorithmus zum Erfassen des Grads 
der Stichwortubereinstimmung in einem dunnbesiedelten Stichwort-Abstand-Suffix-Baum SST^T) in Pseu- 
do-Code wie folgt implementiert wird: 

FindApproximate ( root,p,k) 
node <r- root; 
i +- 1; 

nodes <- Children (node) ; 1 1 A stack of nodes 
for all v g nodes do 

if IsLeaf (v) then 
for j <r- i to length (Suffix (node) ) do 
Wj Suffix (node) j-i; 
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if Wj = »$' then 
Output 

return; 



// is a stopchar 



if EditDist (i) 
break; 



oo then 



else 



//Internal node 



i <r- i+1; 

<- label (v) 
if EditDist (i) = oo then 
break; 

nodes ^- Children (v) □ nodes; 
// end for 

EditDistance(j) // Calculates jth row 

for i<- 1 to length (P) do 

if pi = Wj then d <- 0 else d <- 1; 
ci = D[i-l,j] + c ins (mj); 
c 2 = D[i,j-1] + c del (pi); 
c 3 = £>[i-l,j-l] + c change (pi,mj\; 
c 4 = D[i~2,j-2] + c tr<3nS pose(Pir tfj-j) ; 

<- c fractlon { - min(Ci, c 2 , c 3 , c 4 } ; 
if P[i'j] > * 
return oo; // No distance below /c 

return D[i,j] 



Suchsystem nach Anspruch 3, dadurch gekennzeichnet, daB der zweite Algorithmus zum Bestimmen des Grads 
der Obereinstimmung einer Stichwortsequenz in einem dunnbesiedelten Stichwort-Abstand-Suffx-Baum SST,^ 
(7) in Pseudo-Code wie folgt implementiert wird: 



ApproxSequenceMatch_ED (root, P(=Pi,p 2 ,...,P m ) ,k) 
m<- |p| 
matches <- 0 
startError 0 
startlndex <— 1 

while startError < k OR startlndex < m do 
startNode <- FindExact (p s tartm<u>x) / 
list <- UnorderedOccurrenceList (startNode) ; 
for all v e lists do 
if ApproxMatchRest (v,P,k, startError) then 
matches <- □ v; 

startError <- startError + c del (Pstsrtmdex) ; 
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startlndex startlndex + 1; 

Suchsystem nach Anspruch 7, dadurch gekennzeichnet, daB die Unterroutine ApproxMatchRest des zwerten 
Algorithmus in Pseudo-Code wie folgt implementiert wird: 

ApproxMatchRest ( u, P, K, startError) 

error «— startError/ 

lastError <- startError; 

column <- O ; 

node <- u; 
for v <r- p 2 to pjpi do 
node <- NextOccurrence (node) ; 
word f- Keyword (node) ; 
lastError «- error/ 

error <— startError + EditDistance (column;; 
if error > k AND lastError > k then 
return false; 

return true; 

EditDistance(j) // Calculates jth row 
for i<- 1 to length (P) do 

if pi = then 3 «- 0 else 5 <- 1; 

Ci = D[i-l f j] + CinstJWj); 

c 2 = D[i,j-l] + c de i(Pi); 

C 4 = D[i-2,j~2] + C t ranspose(Pif Wj-i) ; 

^[i/j] <- c fraction (j/l)- min(c lf c 2/ c 3f c 4 ) ; 
if > * 

return oo; // No distance below Jc 

return D[i,j] 

Suchsystem nach Anspruch 3, dadurch gekennzeichnet, daB der dritte Algorithmus zum Ermitteln der exakten 
Stichwortubereinstimmung in einem dunnbesiedelten Stichwort-Abstand-Suffix-Baum SST^T) in Pseudo-Code 
wie folgt implementiert wird: 
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FindExact (root,p) 
i<- 1; 

node <- Find Child (root,pi); 
while node AND i < length (p) do 

if IsLeaf (node) AND Suffix (node) = Pi...P m then 
return node; 

i <- i+1 ' 
node <- FindChild (node, ; 



return NIL; 

8. Suchsystem nach einem der Anspriiche 3 bis 7 : dadurch gekennzeichnet, daB der vierte Algorithmus zum Er- 
mitteln der exakten Ubereinstimmung einer Stichwortsequenz in einem diinnbesiedelten Stichwort-Abstand-Suff ix- 
Baum SST^T) in Pseudo-Code wie folgt implementiert wird: 

MatchSequenceExact (P, root) 
matches <— 0; 
v <— FindExact (pi,root) ; 
if | P| > 1 then 
if v * NIL then 

list <- UnorderedOccurrenceList (v) ; 
for all u e list do 
if NextKeyword (u) = p 2 then 
if MatchRest (p3~Pm,u) then 
matches <- matches □ Occurrence (u) ; 

return matches; 

MatchRest (P,u) 
node u; 

for v <- pi to pi p| do 
node <- NextOccurrence (node) ; 
word <- Keyword (node) ; 
if v f word then 
return false; 

5. Verfahren in einem Suchsystem fur die Informationswiedergewinnung, insbesondere von Information, die in Form 
von Text gespeichert ist, wobei ein Text T Worter und/oder Symboie s und Sequenzen S hieraus umfaRt. wobei 
die Informationswiedergewinnung mit einem gegebenen Oder variierenden Grad an Ubereinstimmung zwischen 
einer Anfrage Ound wiedergewonnene Information flablauft, wobei die Anfrage Q Worter und/oder Symboie q 
und Sequenzen P hieraus umfaBt und wobei die Information R Worter und/oder Symboie und Sequenzen hieraus 
aus dem Text T umfaBt, wobei das Suchsystem eine Datenstruktur zum Speichern wenigstens eines Teils des 
Textes T und ein MaB M umfaBt, welches den Grad der Ubereinstimmung zwischen der Anfrage O und der wie- 
dergewonnenen Information RmiBt, und wobei das Suchsystem einen Suchalgorithmus zur Ausfuhrung einer 
Suche realisiert, insbesondere einer Volltextsuche auf der Basis von Schlusselwortern kw, wobei die Information 
in dem Text Tin Worter s und Wortsequenzen S aufgeteilt wird, wobei die Worter Unterstrings des gesamten 
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Textes sind, welche durch Wortgrenzterme getrennt sind und eine Symbolsequenz bilden, wobei jedes Wort als 
eine Symbolsequenz strukturiert ist, gekennzeichnet durch: 

Erzeugen der Datenstruktur als einen dunnbesiedelten Wortabstand-Suffix-Baum SST WS (T) eines Textes T 
5 zum Darstellen aller Suffixe, die bei einem Worttrennsymbol in dem Text Tbeginnen; 

Speichern von Sequenzinformation der Worter s in dem Text Tin dem dunnbesiedelten Wortabstand-Suffix- 
Baum SST^f 7), 

Erzeugen eines kombinierten EditabstandsmaBes M, daB ein EditabstandsmaB D(s,q) fur Worter s in dem 
Text Tund ein Anfragewort q in einer Anfrage Q und ein wortgroBenabhangiges EditabstandsmaB D^S,^ 

io fur Wortsequenzen S in dem Text T und eine Wortsequenz P in der Anfrage Q umfaBt, wobei das Editab- 

standsmaB D^Sjf) die minimale Summe der Kosten fur Editieroperationen ist, welche eine Sequenz S in 
die Sequenz P umwandeln, wobei die minimale Summe der Kosten die minimale Summe der Kostenfunktionen 
fur jede Editieroperation ist, gewichtet mit einem Wert, der proportional zur Anderung in der Gesamtlange der 
Sequenz S ist, oder mit dem Verhaltnis der momentanen Wortlange und der mittleren Wortlange in der Se- 

15 quenz S;P; und 

Bestimmen des Grads der Ubereinstimmung zwischen Wortem s,q durch Berechnendes EditabstandsmaBes 
D{s,q) zwischen den Wortern s der wiedergewonnenen Information R und den Wortern q einer Anfrage Q, 
oder in dem Fall, daB die Worter s,q mehr als k Fehler voneinander abweichen, Bestimmen des Grads der 
Ubereinstimmung zwischen den Wortfolgen S R ;P Q der wiedergewonnenen Information flbzw. einer Anfrage 

20 Q durch Berechnen des EditabstandsmaBes D WS (S W ,P Q ) fur alle Ubereinstimmungen. 

10. Verfahren nach Anspruch 9, gekennzeichnet durch das zusatzliche Gewichten einer Editieroperation, die ein 
Wort s in ein Wort q andert, mit einem Parameter fur die Nahe zwischen den Zeichen der Worter $;q, und somit 
Berucksichtigen der Ahnlichkeit der Worter s;q bei der Ermittlung der Kosten der in Rede stehenden Editierope- 

25 ration. 

11. Verfahren nach Anspruch 9, gekennzeichnet durch : 

Begrenzen der Anzahl der Ubereinstimmungen durch Berechnen des EditabstandsmaBes D^S^Pq) fur 
30 eine beschranke Anzahl von Wortern in der Anfragewortsequenz P Q . 

12. Verfahren nach Anspruch 9, gekennzeichnet durch: 

Definieren des EditabstandsmaBes D{s,q) zwischen Wortem s und einem Wort q auf rekursive Weise und 
35 Berechnen des EditabstandsmaBes D(s,q) mit Hilfe einer dynamischen Programmierprozedur. 

13. Verfahren nach Anspruch 9, gekennzeichnet durch: 

Definieren des EditabstandsmaBes D WS (S,P) zwischen Sequenzen S und einer Sequenz Pauf rekursive Weise 
40 und Berechnen des EditabstandsmaBes D WS {S,P) mit Hilfe einer dynamischen Programmierprozedur. 

14. Verwendung des Suchsystems nach Anspruch 1 in einer Naherungs-Suchmaschine. 



45 Revendications 

1. Systeme de recherche pour la recuperation d'informations, en particulier d'informations stockees sous forme de 
texte, dans lequel un texte T comprend des mots et/ou des symboles s et des sequences S de ces elements, dans 
lequel la recuperation d'informations a lieu avec un degr6 predetermine ou variable de correspondance entre une 

50 demande Q, ou la demande Q comprend des mots et/ou des symboles q et des sequences P de ces elements, 

et des informations recuper6es R comprenant des mots et/ou des symboles et des sequences de ces 6l6ments 
du texte T, dans lequel le systeme de recherche comprend une structure de donnees pour stocker au moins une 
partie du texte T, et un moyen de mesure M qui donne une mesure du degre de correspondance entre ia demande 
Q et les informations recuperees R, et dans lequel le systeme de recherche met en oeuvre des algorithmes de 

55 recherche pour ex6cuter une recherche, en particulier une recherche de texte seulement sur la base de mots clefs 

kw, caracterise en ce que : 

la structure de donnees comprend une structure en arbre ayant la forme d'un arbre de suffixes peu dense et 
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irregulier ST(T) pour stocker des suffixes de mots et/ou de symboles s et de sequences S de ces elements 
du texte T ; 

!e moyen de mesure M comprend une combinaison d'un moyen de mesure de distance d'edition D(s,q) donnant 
un degre approximattf de correspondance entre des mots eVou des symboles s,q respectivement du texte T 
et d'une demande Q et un moyen de mesure de distance d'edition Dw S (S,P) donnant un degre approximatif 
de correspondance entre des sequences S de mots et/ou de symboles s du texte T et une sequence de 
demande P de mots et/ou de symboles q de la demande Q, le dernier moyen de mesure de distance d'edition 
comprenant des fonctions de ponderation de cout pour des operations d'edition qui transforment des sequen- 
ces de mots et/ou de symboles s du texte T en des sequences P de mots et/ou de symboles q de la demande 
Q, la ponderation ayant lieu avec une valeur qui est proportion nelle a un changement de la longueur de la 
sequence S lors d'une transformation, ou qui depend de la taille des mots et/ou des symboles s,q des se- 
quences S,P a comparer ; 

les algorithmes de recherche mis en oeuvre comprennent un premier algorithme pour determiner le degre de 
correspondance entre des mots et/ou des symboles s,q dans la representation sous forme d'arbre de suffixes 
respectivement du texte T et d'une demande Q, et un deuxieme algorithme pour determiner le degre de cor- 
respondance entre des sequences S,P de mots et/ou de symboles s,q dans la representation sous forme 
d'arbre de suffixes respectivement du texte T et de la demande Q, lesdits premier et/ou deuxieme algorithmes 
recherchant dans la structure de donnees des demandes Q ayant la forme de mots, de symboles, de sequen- 
ces de mots ou de sequences de symboles ou de combinaisons de ces elements, de telle maniere que les 
informations R sont recuperees sur la base d'une demande Q avec un degre predetermine de correspondance 
entre les informations et la demande ; et 

les algorithmes de recherche comprennent egalement de maniere optionnelle un troisieme algorithme pour 
determiner une correspondance exacte entre des mots et/ou des symboles s,q dans la representation sous 
forme d'arbre de suffixes respectivement du texte T et de la demande Q et/ou un quatrieme algorithme pour 
determiner une correspondance exacte entre des sequences S, P de mots et/ou de symboles s : q dans la 
representation sous forme d'arbre de suffixes du texte T et de la demande Q, les troisieme et/ou quatrieme 
algorithmes recherchant dans la structure de donnees les demandes Q sous la forme de mots, de symboles, 
de sequences de mots, ou de sequences de symboles ou de combinaison de ces elements, de telle maniere 
que les informations R sont recuperees sur la base de la demande Q avec une correspondance exacte entre 
fes informations et la demande. 

Systeme de recherche selon la revendication 1 , caracterise en ce que I'arbre de suffixes peu dense et irreguUer 
ST(T) est un arbre de suffixes peu dense dont les intervalles sont des mots SST^fT) comprenant seulement un 
sous ensemble des suffixes du texte T. 

Systeme de recherche selon la revendication 2, caracterise en ce que I'arbre de suffixes peu dense dont les 
intervalles sont des mots SST^fT) est un arbre de suffixes peu dense dont les intervalles sont des mots clefs 

Systeme de recherche selon la revendication 3 : caracterise en ce que le premier algorithme pour detecter le 
degre de correspondance de mots clefs dans un arbre de suffixes peu dense dont les intervalles sont des mots 
clefs SST^fT) est mis en oeuvre au moyen du pseudo-code suivant : 
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FindApproximate (root, p, k) 
node <- root ; 
i <r 1 ; 

nodes 4- Children {node) ; // A stack of nodes 
for all v g nodes do 

if isLeaf (v) then 

for j f i to length (Suffix (node) ) do 
Wj <r Suffix (node) j_i ; 
if Wj = »$' then // is a stcpchar 

Output Wl j ; 

. return ; 
if EditDist (i) = oo then 
break ; 

else //Internal node 

i <r i+1 ; 

Wi <- label (v) 
if EditDist (i) = oo then 
break ; 

nodes <- Children (v) □ nodes ; 
// end for 

EditDistance(j) //Calculates jth row 

f or i f 1 to length (P) do 



if pi = Wj then d <- 0 else 3 1 ; 
c x = D[i-1, j] + Cinstmj) ; 
C 2 = i>[i, j-1] + C^e! (pi) ; 
C 3 = D[i-l f j-1] + Cchange (p i# ntj) ; 
C 4 = D[i-2, j-2] + C transpose (pi, mj.2) ; 
D[i, j] <- Cfraction (j/1) ..min (C 1# C 2 , C 3 , C 4 ) ; 
if D[i, j] > k 

return oo ; //No distance below k 

return D(i, j] 
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Systeme de recherche selon la revendication 3 : caracterise en ce que le deuxieme algorithme pour determiner 
le degre de correspondance de sequences de mots clefs dans un arbre de suffixes peu dense dont les intervals 
sont des mots clefs SST^fT) est mis en oeuvre par le pseudocode suivant : 



ApproxSequenceMatch_ED (root, P(=Pi, p 2 , . p^) ,k) 
m | p | 

matches <- 0 
startError <- 0 
startlndex £• 1 

while startError < k OR startlndex < m do 
startNode <r- FindExact (P st artlndex) / 
list <- UaorderedOccurrenceList (startNcde) 
for all v e lists do 

if ApproxMatchRest (v, p, 3c, startError) th en 
matches <- □ v ; 

startError startError + c^ el (p st artlndex> ; 
startlndex <- startlndex + l ; 

Systeme de recherche selon la revendication 5, caracterise en ce que le sous programme ApproxMatchRest du 
deuxieme algonthme est mis en oeuvre par le pseudo-code suivant : 



28 



EP 1 095 326 B1 



ApproxMatchRest (u, P, K, startError) 
error <- startError ; 
lastError <r startError ; 
column <- 0 ; 
node <- u ; 

for v 4- P2 to Pjpj do 
node <- NextOccurrence (node) ; 
word <- Keyword (node) ; 
lastError 4- error ; 

error <- startError + EditDistance (column) ; 
if error > k AND lastError > k then 
return false / 

return true ; 

EditDistance (j) // Calculates jth row 

f or i f 1 to length (P) do 
if pi s wj then d <- 0 else 9 f 1 ; 

Ci = D[i-1, j] + Ci ns (mj) ; 

C 2 = D[i, j-1] + Cdei(pi) ; 

C 3 = D[i-1, j-1] + Cchange (p^ mj) ; 

C 4 = D[i-2, j-2] + transpose (Pi' ^j-l) » 

D[i,j] <r Cf rac tion (j/U .min (C 1# C 2 , C 3 , C 4 ) ; 

if D[i, j] > k 

return oo ; // No distance below k 
return D[i, j] 

Systeme de recherche selon la revendication 3, caracterise en que le troisieme algorithme pour determiner une 
correspondance exacte de mots clefs dans un arbre de suffixes peu dense dont les intervalles sont des mots clefs 
SSTkwsCO est mis en oeuvre par le pseudo-code suivant : 

FindExact (root, p) 
i 4r 1 ; 

node <• Find Child (root, Pi) ; 
while node AND i ^ length (p) do 
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if IsLeaf (node) AND Suffix (node) = Pi-..P m then 
return node ; 
i <r i+1 

node 4- FindChild (node, p ± ) ; 

RETURN NIL ; 

I. Systeme de recherche selon I'une quelconque des revendications 3 a 7, caracterise en ce que le quatrieme 
algorrthme pour determiner une correspondance exacte de sequences de mots clefs dans un arbre de suffixes 
peu dense dont les intervalles sont des mots clefs SST^fT) est mis en oeuvre par le pseudo-code suivant : 

Mat chSequenceExact (p, root) 
matches 4- 0 ; 
v <r FindExact (p lt root) ; 
if |P| > 1 then 
if v * NIL then 

list <r UhorderedOccurrenceList (v) ; 
for all u e list do 

if NextKeyword (u) = p 2 t>***n 
if MatchRest (P3...Pm, u) then 
matches matches □ Occurrence <u) ; 
return matches ; 



MatchRest (P, u) 
node u ; 

for v <r pi to p| p | do 
node <r Next Occurrence (node) ; 
word <- Keyword (node) ; 
if v ^ word then 
return false / 

Precede de recuperation ^informations dans un systeme de recherche, en particulier deformations stockees 
sous forme de texte, dans lequel un texte T comprend des mots et/ou des symboles s et des sequences S de ces 
elements dans lequel la recuperation deformations a lieu avec un degre predetermine ou variable de correspon- 
dance entre une demande Q s dans lequel la demande Q comprend des mots et/ou des symboles q et/ou des 
sequences P de ces elements, et les informations recuperees R comprennent des mots et/ou des symboles et 
des sequences de ces elements du texte T, dans lequel le systeme de recherche comprend une structure de 
donnees pour stocker au moins une partie du texte T, et un moyen de mesure M qui donne une mesure du degre 
de correspondance entre la demande Q et les informations recuperees R, et dans lequel le systeme de recherche 
met en oeuvre des algonthmes de recherche pour executer une recherche, en particulier une recherche de texte 
seulement sur la base de mots clefs kw, dans lequel les informations du texte T sont divisees en mot s et en 
sequences de mot S, les mots etant des sous-chames du texte entier separees par des termes de limite de mots 
et formant une sequence de symboles, et dans lequel chaque mot a la structure d'une sequence de symboles 
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caracterise en ce qu'il comprend les Stapes consistant a : 

produire la structure de donn6es sous la forme d'un arbre de suffixes peu denses dont les intervalles sont des 
mots SST WS (T) d'un texte T pour representer tous les suffixes commencant par un symbole de separation 
5 de mots dans le texte T ; 

stocker des informations de sequence des mots s du texte T dans I'arbre de suffixes peu dense dont les 
intervalles sont des mots SST WS (T) ; 

produire un moyen de mesure de distance d'edition combine M comprenant un moyen de mesure de distance 
d'edition D(s,q) entre des mots s du texte T et un mot de demande q d'une demande Q et un moyen dependant 

10 de la taille des mots D^tS.P) de mesure de distance d'edition entre des sequences S de mots s du texte T 

et une sequence P de mots q de la demande Q, le moyen de mesure de distance d'edition D WS (S,P) etant la 
somme minimum du coQt des operations d'edition transformant une sequence S en la sequence P, la somme 
minimum des coQts etant la somme minimum des fonctions de coOt pour chaque operation d'edition ponderee 
par une valeur proportionnelle au changement de la longueur totale de la sequence S ou par le rapport de la 

* 5 longueur de mot courante et de la longueur de mot moyenne dans les sequences S,P ; et 

determiner le degr6 de correspondence entre des mots s,q en calculant la distance d'edition D(s,q) entre les 
mots s des informations recuper6es R et du mot q d'une demande Q t ou dans le cas ou les mots s,q sont 
distants de plus de k erreurs I'un de I'autre, determiner le degre de correspondence entre des sequences de 
mots S R , P Q des informations r6cuperees R et d'une demande Q respectivement en calculant la distance 

20 d'edition Dy^ (S R , P Q ) pour toutes les comparaisons. 

10. Procede selon la revendication 9, caracterise en ce qu'il comprend en outre retape consistant a ponderer une 
operation d'edition qui change un mot s en un mot q avec un paramdtre de proximite des caracteres des mots s, 
q, prenant ainsi en compte la ressemblance des mots s,q lors de la determination du cout de {'operation d'edition 

25 en question. 

11. Procede selon la revendication 9, caracterise en ce qu'il comprend retape consistant a limiter le nombre de 
comparaisons en calculant la distance d'edition D ws (S Rl P Q ) pour un nombre restreint de mots de la sequence de 
mots de demande P Q . 

30 

12. Proc6de selon la revendication 9, caracterise en ce qu'il comprend I'etape consistant a d6f inir la distance d'edition 
D(s,q) entre des mots s et un mot q de maniere recursive et a calculer la distance d*6dition D(s,q) au moyen d'une 
procedure de programmation dynamique. 

35 13. Procede selon la revendication 9, caracterise en ce qu'il comprend retape consistant a definir une distance d'edi- 
tion D^S ,P) entre des sequences S et une sequence P de maniere recursive et a calculer la distance d'edition 
D WS (S, P) au moyen d'une procedure de programmation dynamique. 

14. Procede consistant a utiliser un systeme de recherche selon la revendication 1 dans un moteur de recherche a 
approximation. 
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